R documentation: Sessa Empirical Estimator (SEE)
In arkh88/SEE_test: Sessa Empirical Estimator

Description Usage Arguments Details Note Author(s) References Examples

Please refer to the section details.

1	SEE(Dataset,Person_ID,Dispensing_date)

`Dataset`	An analytical dataset containing filled prescriptions within the observational window for individuals included in the study population. The analytical dataset should contain only 2 columns. The first, should a column containing the personal identifier of each patient included in the study population. The second column should contain the dispensing date of filled prescriptions by individuals included in the study population during the observational window. The two columns should be respectively formatted as factor and date.
`Person_ID`	Personal identifier of individuals included in the study population.
`Dispensing_date`	Dispensing date of filled prescriptions within the observational window for individuals included in the study population.

The SEE is an algorithm composed of multiple steps aiming to predict the duration of filled prescriptions when information regarding the true duration is not available. The SEE relies on individual-level information on the date of redemption of a medicinal product for predicting the duration of filled prescriptions. It assumes that the duration of a filled prescription is associated with the temporal distance between subsequently filled prescriptions as previously described in the waiting time distribution method. SEE aims at clustering temporal distances between filled prescriptions into K groups with similar patterns and the predicted durations within each group. The algorithm steps and the rationale for each step are described in the following:

1. Select the study population, define the observational window, and retrieve all the filled prescriptions within the observational window of each patient included in the study population (censoring at death or permanent emigration).

Rationale: Medication can be used with different dosage schemes for different indications, which may lead to different durations of a filled prescription based on the indication of use of the medicinal product, and other factors such as severity of the disease, adherence, amount dispensed, and prescribed dosing. Selecting a well-defined study population is crucial for the method to avoid mixing of individuals using the medication for different indications, with different disease severity, adherence patterns, and different amount dispensed and prescribed dosing. SEE overcomes the problem of different indications and other factors influential for differences in dosage schemes by identifying the desired population of interest and retrieving the filled prescriptions for this population in the observational window. This approach also overcomes another problem commonly encountered in pharmacoepidemiology, i.e. that the duration of the filled prescriptions can change in different stages of the pharmacological treatment. For example, higher dosage can be needed at the start of treatment rather than at later stages or vice versa. SEE requires the user to define an observational window that is representative of a specific stage of pharmacological treatment. An observation window can be a fixed period (e.g., 365 days) or an undefined period from the index date (e.g., ending at death, emigration, or end of data coverage). Censoring for emigration, death, or exclusion criteria are taken into considerations also for fixed period while defining the follow-up period in the observational window.

2. Among dates of redemption of consecutive prescriptions of all patients within the observational window, SEE computes the Empirical Cumulative Distribution Functions (ECDF) of temporal distances. To avoid including artificially long temporal distances introduced by early- and delayed-discontinuers, the SEE retains only 80 percent of the ECDF.

Rationale: Artificially long temporal distances can be generated by using the temporal distance between two subsequent filled prescriptions due to, for example, individuals that stop and re-start treatments (stoppers and re-starters), individuals with poor adherence, etc. Stoppers and re-starters are individuals that start medication at one point in time and stop (e.g., due to an adverse reaction to the pharmacological treatment) for a long period before starting the medication again. In SEE, this problem is solved by cutting the last 20 percent of the ECDF, which includes the very long temporal distances among consecutive prescriptions.

3. For each individual in the study population, SEE randomly selects a pair of consecutive filled prescriptions in the observational window computing the logarithm of temporal distances in days.

Rationale: To reduce the skewness of empirical distributions, the logarithmic transformation is applied to the temporal distances between consecutive filled prescriptions. By selecting a random pair of consecutive filled prescriptions, SEE remove the overrepresentation of individuals filling prescriptions of the pharmacological treatment more often due to a lower amount prescribed. If all the prescriptions would be included those individuals would contribute more to the estimation of the average temporal distances than those filling prescriptions less often due to a higher amount prescribed.

4. The temporal distances for the randomly selected filled prescriptions undergo standardization (18). Subsequently, using the K-means algorithm, the temporal distances are clustered into k groups minimizing the sum of squares from the temporal distances to the assigned cluster centers (19), and the optimal number of clusters is selected using Silhouette Analysis.

Rationale: The standardized vector of temporal distances is used for clustering using the K-means algorithm. Standardization is performed because K-means is a distance-based algorithm that is affected by the scale of a variable. Silhouette Analysis provides the optimal number of clusters measuring the quality of the clustering and it is used to determine how well each object lies within its cluster. A high average silhouette width indicates good clustering. Therefore, the SEE selects the number of clusters to be used in the K-means clustering algorithm providing the highest value of the average silhouette widths.

5. For each cluster, SEE builds the probability density function (PDF) of temporal distances to find the median temporal distance and exponentiate it. These median values are used as the predicted duration of the prescriptions in the cluster.

Rationale: This step is performed to obtain the predicted durations of filled prescriptions for each cluster (the median temporal distance). Exponentiation is performed to remove the logarithmic transformation previously applied to reduce the skewness of empirical distributions.

6. Finally, SEE computes the predicted end of supply for each filled prescription using the predicted durations.

Rationale: Based on the predicted durations of filled prescriptions the predicted end of each supply is computed as the date of redemption plus the predicted duration.

SEE includes a simulated dataset named "sim_disp_apixaban" which contains simulated filled prescription of apixaban for 19774 individuals. The simulated dataset contains 7 columns which contain: 1) person_id: Person ID for each individual in the study population; 2) disp_date: Date at which the individual filled a prescription of apixaban; 3) num_pack: Number of packages filled; 4) atc_code: Anatomical Therapeutic Chemical classification code of apixaban; 5) str_text: Strenght of the redeemed posological units; 6) units_pack: Total number of posological units per package; 7) ddd_pack: Defined Daily Dose contained in each package.

Dr. Maurizio Sessa

University of Copenhagen, Faculty of Health Science, Department of Drug Design and Pharmacology

Contacts: Jagtvej 160, 2100 Copenhagen K, DENMARK

E-mail: maurizio.sessa@sund.ku.dk

Dr. Abdul Rauf Khan

Technical University of Denmark, Department of Applied Mathematics and Computer Science Statistics and Data Analysis

Contacts: Richard Petersens Plads, 324, 210, 2800 Kgs. Lyngby, DENMARK

E-mail: arkh@dtu.dk

Potteg?rd A, Hallas J. Assigning exposure duration to single prescriptions by use of the waiting time distribution. Pharmacoepidemiology and drug safety. 2013;22(8):803-9.

Stovring H, Pottegard A, Hallas J. Determining prescription durations based on the parametric waiting time distribution. Pharmacoepidemiol Drug Saf. 2016;25(12):1451-9.

Thrane JM, St?vring H, Hellfritzsch M, Hallas J, Potteg?rd A. Empirical validation of the reverse parametric waiting time distribution and standard methods to estimate prescription durations for warfarin. Pharmacoepidemiology and drug safety. 2018;27(9):1011-8.

B?dkergaard K, Selmer RM, Hallas J, Kjerpeseth LJ, Potteg?rd A, Skovlund E, et al. Using the waiting time distribution with random index dates to estimate prescription durations in the presence of seasonal stockpiling. Pharmacoepidemiology and Drug Safety. 2020;29(9):1072-8.

Hallas J, Gaist D, Bjerrum L. The waiting time distribution as a graphical approach to epidemiologic measures of drug utilization. Epidemiology. 1997:666-70.

De Amorim RC, Hennig C. Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences. 2015;324:126-45.

Steinley D. K-means clustering: a half-century synthesis. British Journal of Mathematical and Statistical Psychology. 2006;59(1):1-34.

##library(SEE)
##library(dplyr)

##Using an existing dataset
##data("sim_disp_apixaban")
##sim_disp_apixaban<- sim_disp_apixaban[,c("person_id","disp_date")]
##sim_disp_apixaban <- SEE(sim_disp_apixaban, person_id, disp_date)

##Using another simulated dataset
##eksd <- sample(seq(as.Date('1995-01-01'), as.Date('2020-12-01'), by="week"), 1000)
##pnr <- as.factor(rep(1:10, each=100))
##temp <- "PID"
##atc <-  "B01XX01"
##mydf <- as.data.frame(cbind(eksd, pnr))
##mydf$pnr<- with(mydf, paste0(temp, pnr))
##mydf$eksd <- as.Date(eksd, "%YYYY/%mm/%dd")
##packsize <- rep(c(25,50,100), times = nrow(mydf))
##a <- as.numeric(nrow(mydf))
##packsize <- packsize[1:a]
##apk <- round(runif(nrow(mydf), min=1, max=2))
##strnum <- rep(c(25,50,75,100, 150, 200), times = nrow(mydf))
##strnum <- strnum[1:a]

##mydf <- as.data.frame(cbind(mydf, atc, packsize, apk, strnum))

##mydf <- mydf[,c("pnr","eksd")]

##mydf <- SEE(mydf, pnr, eksd)